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Abstract 

The persistent mutual information (PMI) is a complexity measure for stochastic 
processes. It is related to well-known complexity measures like excess entropy or 
statistical complexity. Essentially it is a variation of the excess entropy so that it 
can be interpreted as a specific measure of system internal memory. The PMI was 
first introduced in 2010 by Ball, Diakonova and MacKay as a measure for (strong) 
Q-f emergence [BallO]. In this paper we define the PMI mathematically and investigate 

the relation to excess entropy and statistical complexity. In particular we prove 
that the excess entropy is an upper bound of the PMI. Furthermore we show some 
properties of the PMI and calculate it explicitly for some example processes. We 
also discuss to what extend it is a measure for emergence and compare it with 
alternative approaches used to formalize emergence. 

1 Preliminaries 

Let (Q, J 7 , P) be a probability space with a metric space Q, a a-algebra T and a probability 
measure P. For random variables X, Y : Q — > A mapping to a finite alphabet A the 
Shannon entropy is denned by 



H(X) := - Pr (^ = x ) log Pr (X 



x) 



and the conditioned Shannon entropy by 

H(X \Y):=-^2 Pr(X = x,Y = y) logPr(X = x \ Y = y), 

x,ye.A 

where Pr(X = x) := P {{oj G Vt \ X(u) = x}) denotes the probability that the random 
variable X is equal to x G A, Pr(X = x, Y = y) is the joint probability between X and Y 
and for Pi(Y = y) > the conditional probability is Pr(X — x \ Y — y) :— Fr ^p r ^y^ y J y ^ . 
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In the definitions the convention 01og(0) = is used. The mutual information between 
two random variables is 

I(X;Y) := H(X) - H(X | Y). 

The mutual information is non-negative {I{X\ Y) > 0) and equals zero if and only if X 
and Y are independent random variables [Cov06]. 

We consider a time-discrete stationary stochastic process S := (S t )tez with random 
variables St '■ $1 — > A for times teZ. We define the semi-infinite processes S := (S^ t )tm 
interpreted as past and S := (St)tm interpreted as future respectively Blocks of random 
variables with finite length are denoted by S% := (Sk)ke[a,b]nz for — oo < a < b < oo and 
the corresponding block entropy is H(L) := H(Si) = H(S\, . . . , Sl)- The one-sided 
sequence space is A N := Xj gN ^4 and in the same way the two-sided sequence space A z is 
defined. We introduce the shift function a : A z — > A z by a(x)i := Xj+i. At any time t E Z 
we have random variables Stoo '■= (Sk)k<t an d S^i := (Sk)k>t+i that govern the systems 
observed behaviour respectively in the shifted past and the shifted future. The mutual 
information between these two variables is the well-known excess entropy [Cru83, Cru03] 

E:= lim IiSt^Sll). (1.1) 

L— too 

In general, it is not clear if the limit in (1.1) exists (for Markov processes of finite order 
one can prove the existence). With the assumption that the limit in (1.1) exists as a finite 
number the following equality holds: E = I(S ; S), see Chapter 2.2 in [Pin64]. 

2 Conceptualization 

The definition of the excess entropy (1.1) allows a concrete information theoretic inter- 
pretation. In particular the excess entropy can be seen as a specific measure of system 
internal memory. We will take this as a basis to define a new term, first suggested in 
[BallO], which will capture the structural behavior of a dynamical system on the whole 
time-domain. In particular it should be possible to detect any existing inherent structure 
of the system which will survive for all times. In order to achieve this goal we adapt the 
mutual information-based representation of the excess entropy and introduce the following 
expression 

E% T := I(S$ +L - l ;Si;_ L+1 ). 

For t — and r = 1 this expression coincide with the finite-length excess entropy and we 
have 

E = lim Eq V 

L— >oo ' 

For arbitrary t and r- values we get a family of similar terms 

Et, T := lim E^ T . 

L— ^oo ' 

Every expression E^ T is the excess entropy with a time-gap of size \t — r\ between a random 
variable block of the past and the future. For stationary processes we can write E^ T as 

rpL j( qL—\. q—T — t \ r/ni-1, r»— k \ 

t,r v i d-T-t-L+l) — 1 WO > d-k-L+l)i 
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with k := r + t. Instead of E^ T we often write E k . 

To ensure that the double sequence (E^l^en converges to E (written as lim E k ), for 

k,L— >oo 

every e > two numbers m,n G N need to exist so that for all k > m, L > n holds 
that \E k — E\ < e. A simple sequence in the double sequence E k is defined with two 

subsequences L, % — P° oo and ki oo by (i?^*) . The double sequence E k converge to 

E, if and only if all simple sequences in E k converge to E [LonOO]. In particular it holds 
that 

lim lim E k = lim lim E k . 

k— ¥00 L—too L—>oo k—^oo 

The reverse direction of the last conclusion does not hold. 

2.1 Definition Let a stochastic process with values in a finite alphabet A be given. The 
persistent mutual information of such a process is defined by 

PMI := lim E k . 

k,L— >oo 

If the PMI exists, it is enough to consider the iterated limits 

PMI = lim lim E k = lim lim E%. 

L— >oo fc— ^oo k— >oo L— >oo 

In the following we want to investigate this expression, which was proposed first by Ball 
and collaborators in [BallO]. For stationary processes we can write the persistent mutual 
information (if it exists) as 

PMI = lim (h(L)- lim H{Slr 1 \Sll_ I + 1 j) 

L— >oo V fc— >oo / 

= lim (2H(L) - lim H{S^\ SZ k k _ L+1 )) (2.1) 

L— >oo \ k— >oo / 

= lim (h(L) - lim H(S^- 2 \Sll)) 

L— J-OO \ fc— >oo / 

= lim (2H(L) - lim H{S k k ^~\ Szl)) . 

The last identities follow from the chain rule for the conditional entropy and the station- 
arity of the process. Remark since the PMI is assumed to exist it is possible to exchange 
the limits. 

We now want to find a reasonable definition of persistent mutual information for one-sided 
processes. A one-sided process is a stochastic process with indices consisting only of 
positive or negative numbers, e.g. N,Z + , M + or Z_,IR_. In order to achieve this we 
consider the excess entropy of such a process. Because of the stationarity of the process 
we can write the excess entropy as 

E = hm /(fif- 1 ; Sll) = lim KSf^ S^). 

Hence we obtain 

rpL _ T( qt+L-1. q-t \ _ Tf ct+2L— 1, q-t+L\ 
t f T \ t ! D -T-L + l) ~ 1 \ D t+L i°-r+lJ> 
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Because of the definition of one-sided processes we set r = 1 and obtain the following 
definition. 

2.2 Definition Let a one-sided stochastic process with values in a finite alphabet A be 
given. The persistent mutual information of such a process is defined by 

PMI := lim /(Sgf- 1 ;^- 1 ). 

t,L->oo 

If PMI exists it is enough to consider 




Different to two-sided processes the reference point (which can be interpreted as presence) 
moves to infinity. Like in the two-sided case we obtain simpler expressions for stationary- 
processes 

PMI = lim (h(L) - lim lf(S£f "^J" 1 )) = lim (2H(L) - lim H{S^~\ S^ 1 )) , 

L— s-oo V t^too / L— s-oo V f^oo / 

where it is again allowed to change the limits if PMI exists. 

2.3 Remark Remark that both PMi-expressions are also defined for nonstationary 
stochastic processes. In this paper we only consider stationary processes. In any case 
the existence of PMI is a priori not clear. Nevertheless we can show that it exists for 
Markov processes of finite order or for periodic processes (see Section 6). 



3 Necessary Conditions for Existence 

From the definition of PMI it is not clear if the limits exist. In this section we assume 
that the double sequence (E%)k,LeN converge and hence the PMI exists. We investigate 
some necessary conditions for the existence of PMI, to be precise we investigate what 
can be deduced from the existence of iterated limits 

lim lim El = lim (2H(L) - lim H(S^\ Sl*_ L+1 )) (3.1) 

L^roo k— s-oo L— >oo V k— s-oo / 

for a corresponding stochastic process. 

We consider two-sided stationary processes and consider the inner limit of (3.1) 
lim H(St\Szt L+1 ) 

k— s-oo 

E ( Pr ^° _1 = ff ' = ® lo ^ Pr ^o - 1 = Sll L+1 = £))) • 

If this limit exist then the limit of the induced probability distribution also exist 
lim Pr^ 1 = a, S~_ k k _ L+l = = }™ P ({" G = ^"'H = a, SljU+iM = £}), 

with cr, £ G ^4 L . This is a limit in the space of all probability distributions on A L which 
we denote as M(A L ). We introduce a topology on M(M). With C(M) we denote the 
space of all continuous functions / : M — )■ R. 
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3.1 Definition The weak* topology on j\4(M) is the smallest topology, such that for 
/i G M(M) and f G C(M) every map M{M) — > C with fi i-> j M fd/i is continuous. A 
basis is given by 

V,(f u ...J k ;e) = LeM(M): 

with u,E M(M),k>l,fie C(M) and e > 0. 

With this definition we can understand the limit above as a weak limit with respect to 
this topology. 

3.2 Definition ([Bil68]) A sequence (P n )neN in A4(M) converges weak* to P G 

M(M), if for all f G C(M) it holds that 

[ fdP n "4°° / fdP 

JM JM 

The Portmanteau Theorem gives a series of equivalent characterizations of the weak*- 
convergence. 

3.3 Proposition (Portmanteau Theorem) Let P n ,P G M(M) and (M, J 7 , P n ), 
(M, J 7 , P) be probability spaces for n 6N. Then the following is equivalent 

(i) P n is weak* convergent to P. 

(ii) lim / fdP n = [ fdP for all f G C(M). 
n ^°° Jm Jm 

(in) lim sup P n (F) < P(F) for all closed sets F G J 7 . 

(iv) liminf P n {G) > P{G) for all open sets G G T . 

(v) lim P n (A) = P(A) for all sets A G T with P[dA) = 0, where the border of A is 

n—>oo 

denoted as dA. 

Proof. 

See [Bil68] Chapter 1.2. □ 

In particular the last equivalence show that the weak* convergence can be understood 
as pointwise convergence in our case, since for all sets a G A L it holds da = 0. A first 
answer to the question when the distributions Pr(S'r^_ i+1 ) have a limit with respect to 
weak* convergence give the following Proposition, which is a version of Proposition 5.5 in 
[Bil68]. 

3.4 Proposition Let P G A4(M), furthermore let Sk ■ — > A be a sequence of measure- 
able mappings, which converge pointwise to a mapping S : Q — > A P-almost everywhere. 
Then it holds that 

Pisj- 1 ) 1 ^ Pis- 1 ), 

w.r.t. weak*-topology. 



fidv - / fid\x 



< e, 1 < % < k 
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Proof. 

We show item (iv) in the Portmanteau Theorem. To do that we define for an open set 
GcA 

T k :=f)Sr\G) = f){u:S l (u)eG}. 

i>k i>k 

Let N be a set of measure zero w.r.t. P, containing those points for which S k does not 
converge pointwise to S. It holds that 

S-\G)c\Jf k UN. 

k 

Because of P(N) = it follows that P (S'^G)) < P ^U^j ■ Furthermore f k C f k+1 . 
For e > and k choosen large enough we obtain 

P(S~ 1 (G)) < P(f k ) + e. 
With f k C S k 1 (G) it follows that P{f k ) < P(S k l (G)). Putting things together we obtain 

P{S-\G)) < P(f k ) + e < P(S^(G)) + e. 
Since e is arbitrary and the left-handside is not depending on k we get 



P(5- 1 (G))<hminfP(^ 1 (G')). 

fe— >oo 



□ 



3.5 Remark With the same argument as in the proof above one can show for the same 
assumptions the convergence of a finite-length block of random variables 

P{$k 1 i Sk+v ■ ■ ■ ' Sk+L-i) ~ > P(S 

w.r.t. weak*-topology. Furthermore one can extend the result to joint distributions of 
different random variables 

P(S 1 , ■ ■ ■ , S L ^ ± , S k 1 , . . . , 5' fc4 1 L _ 1 ) — > P(S 1 , . . . , S L \, S 1 ), 

w.r.t. weak* -topology. 

To fulfill the assumptions of the last proposition, the random variables of a stochastic 
process need to converge almost everywhere. If the set of all points u E Q for which 

S- k (u) k ^°S(cj), (3.2) 
not hold, is a set of measure zero w.r.t. P, then the limit of the distributions exist 
hm Pr^- 1 = a, S~_ k k _ L+l = = Pr^ 1 = a, S = 

fe— s-oo 

Hence the following limit exists 

hm H(St\SZ k k _ L+1 ) = H(St\S). 

k— >oo 

With that we have shown the following proposition. 
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3.6 Proposition Assume that PMI exists for a stationary stochastic process and 
the process fulfill the convergence condition (3.2) P a.e., then the PMI is the mutual 
information-version of the "excess entropy" of the following stochastic process 

S = (. . . , S, . . . , S, So, . . . , Sx_i, . . .), (3.3) 

to be precise (S t )tez with S t := S for all t < and S t '■— S t for t > 0. In general {S t )t& 
is not stationary and it holds that 

PMI = lim (2H(L) - H(S^-\ S)) = lim /(Sf -1 ; S) = 10; S). 

If the process o is a one-sided stationary process and if we assume that S t (uj) S(u) 
P-a.e., then the same result holds 

PMI = lim (2H(L) - H{S, S^ 1 )) = lim I(S; S^ 1 ) = I(S; ^). □ 

3.7 Remark If the process (3.3) in Proposition 3.6 is stationary it holds for two-sided 
and one-sided processes that PMI = H(S). 

The PMI is the mutual-information based version of the excess entropy of a process with 
constant past (and with constant future in the one-sided case). In the original process 
this constant past is located very far in the past. The PMI can thus be understood as 
the amount of information which is communicated from a very far past to the future. 
In this sense the PMI represents a kind of memory which is permanently stored in the 
process for all times. Thus the PMI can be considered as an inherent measure of the 
system complexity. 

3.8 Remark Note that because of 

< H(L) < Llog(|^|) 

and 

o<^ +L - 1 ,5:;_ L+1 )<2Liog(|^|), 

the convergence at t, r — > oo leads to a unique limit or to a certain number of accumulation 
points. If the expression H{S t t +L ^ 1 , SZ^-l+i) ^ s nionotonic increasing or decreasing w.r.t. 
t or r, a limit exists. Proposition 3.6 explains the PMI only for the special class of 
processes for which (3.1) exists. 



4 Relation to Statistical Complexity 

We now pick up the sketched ideas in [BallO], to express the PMI with so called causal 
states. In particular one can show that the statistical complexity (internal entropy of the 
causal states) is an upper bound for the PMI. In the rest of this section we assume 
that the PMI exist. We start with introducing time- indexed causal states. We consider 
shifted blocks of random variables 

S T := (S- T -t)t<=N ' '■= {Sr+tjteKo i 
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for r G No. The sets of realisations 1 are denoted by S T C and the sub-a- 

algebras which are generated by cylinder sets are denoted with C T; _n C C_n,C t ,n C Cn - 
On the set A N of all shifted past trajectories of the process S we define an equivalence 
relation 



V ~ V' Pr(^ = ^ I ^ = V) = Pr(^ = ^ I *S T = V), G Ci 



N i 



where V, V' G and Pr(S^ = ~~£ \ S T = ts~) is a regular version of the conditional 
expectation. The equivalence classes 

fi+(V) := {V' G : V' ~ r V} C 

of this relation are called shifted causal states. The set of all shifted causal states is 
denoted by <S+ := {S+(s) | s G A N }. 

In the same sense we define (future) shifted causal states S~(lt) and <S~ (we only have 
to change the rule of past and future trajectories V and s). 

We are only considering stationary stochastic processes with a finite set of shifted causal 
states <S+ = {Si, . . . , S^} and S~ = {S^ , . . . , S^}. Given a past observation of infinite 
length G A z at time f G Z using stationarity we identify this shifted past with 
a shifted causal state S^(a~ t ~ T ~ 1 (s t _ OQ )) G <S+. Together with the next symbol s t +i 
generated by the process the next shifted causal state S^(a~ t ~ T ~ 2 (s t _ oc St+i)) G 5+ is 
uniquely determined and the shifted causal states are Markov [ShaOl, LoelO]. We define 
the Markov kernels between two shifted causal states Sf,Sj~ G emitting an output 
symbol r G A for any t G Z as follows 

:= T{S+){S+,r) 

= Pr (5( ( x-^- 2 ( S i _ ooSt+1 )) = S+ and fife = r | S{a~ M ( fl *_J) = ^ + ) ■ 

The set of transition matrices is denoted with T+ := | ^T^/^j : r G .4.1. The prob- 
ability of a shifted causal state Sf G 5+ is denoted by p\ := P(Sf). The ordered pair 
M+ := (T+, (pi , . . . ,Pn)) is called shifted (past) e-machine. In the same way we can 
define a shifted (future) e-machine M~ := (T~ , (pj, . . . , _p~)). 
The shifted e-machines has internal state entropies 



n 



C+ tT :=H{S+) = -2^p}]ogpt 

and 



C PtT :=H(S;) = -J2pJ^gpj, 

i=i 

which are also known as (shifted) statistical complexities [Gra86, ShaOl]. 
We can write the PMI as follows. 



1 For every u € the mapping i? w : i i-> S-t(uj) is called a realisation of the process S 1 . The set of all 
realisations is defined as S := {(i? w (£))t e N : w £ O}. 
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4.1 Proposition Assume that PMI exists for a stationary stochastic process, then it 
holds that 

PMI = lim I(S;S T ). 

Proof. 

Since PMI exists we can change the limits and write them as 

PMI= lim lim I(SZI;S^- 2 ). (4.1) 

We can decompose the limits in (4.1) into two independent limits and get with Proposition 
A.l (iii) applied two times 

PMI= lim lim I(Szb&~ 2 ) = lim lCS;~3 T )- 

t— >oo L— >oo T— VOO 

□ 

Similar to the fact that the excess entropy can be expressed via causal states (see Propo- 
sition B.2 and [E1109]), we can also express the PMI via shifted causal states. 

4.2 Proposition Assume the PMI of a stationary stochastic process exists. Then we 
can write the PMI as 

PMI = lim I{S+;S;) = lim /(<S+;«S ~)- 

Proof. 

We take instead of S and Sq,S^ instead S + ,S~, then the proof is with Proposition 
4.1 analogous to the proof of Proposition B.2. We obtain the second equality with the 
symmetry of the mutual information and the stationarity of the stochastic process. □ 

With this expression we get the following inequalities. 

4.3 Corollary The statistical complexity is an upper bound for the PMI, if it exists, 

PMI<C P , PMJ<C+, 
with equality if and only if lim H(Sq\Sz) = or lim H(S~\S^) = 0. 

Proof. 

With Proposition 4.2, the stationarity and the definition of the statistical complexity we 
get 

PMI = H{S^) - lim H(S^\S~) < Ct. 

T—KX) 

With the symmetry of the mutual information we get PMI < Cp. □ 

4.4 Corollary It holds that 

Cp- = PMI + lim H(S + \S~), C P = PMI + lim H(S'\S+), 

T— >00 T— >00 

furthermore we have the following inequalities 

C+ > lim H{S + \S;), Cp > lim H(S~\S+). 
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Proof. 

With Proposition 4.2, the definition of statistical complexity and the symmetry of mutual 
information the equalities follow. Because of PMI > we obtain the inequalities. □ 

4.5 Remark If PMI = 0, then we get with Corollary 4.4 that 

C+ = H(S + ) = lim H(S + \S;). 

T— >0O 

This is the case if and only if lim S~ and S + are stochastic independent. This means 

that the causal states in the very far past are independent from the causal states in the 
future. 



5 Relation to Excess Entropy 

In this section we want to find relations between PMI and excess entropy. One might 
expect that the persistent mutual information coincide with the excess entropy as soon as 
the structure of the past coincide with the structure of the future. The next proposition 
shows that this is indeed the case for processes with zero metric entropy. The metric 

H (L) 

entropy is defined as the following limit hp := lim^oo — j- 1 and exist for all stationary 
processes. 

5.1 Proposition Assume that the excess entropy E and PMI exists for a stationary 
stochastic process, then it holds that 

PMI = E <=^> hp = lim H(S + \S~) = and H(S + \S~) = 0. 

T— S-OO 

Proof. 

We prove the first equivalence. 

=^: Since E is finite we get with Proposition B.l that H(L) ~ E + Lh P as L — > oo. 
Furthermore we get 



PMI = \im(2H(L)- \im H(St\Sll L+l )) (5.1) 
= \im(2E + 2Lh P - lim H(S^-\SZk_ L+1 ))- ( 5 - 2 ) 

Hence we get with PMI = E 

lim ( lim H(S^\ SZk- L+1 ) ~ 2Lh P ) = 2E-PMI — E — lim (H{L) - HiS^lSzl))- 

\im k _ 00 H(S^- 1 ,SZ^ L+1 ) H(L) 

Since 1 < tttti an d j—, r < 1, it follows that 

H{L) ]im k ^ 00 H(St\SZ k k - L+1 )- 



hm H(St\ SZ k k _ L+ i) ~ H(L) as L ^ oo, 



which leads with (5.1) to 



PMI = lim H(L) = E. 

L— s-oo 
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Finally this implies because of Proposition B.l that hp = 0. 

<=: Due to hp — it holds that E = lim H(L). Furthermore it follows that 

H(2L + k)> H(St\ SZI L+1 ) > H(L), 
using H(2L + k) E and H(L) L 4°° E leads to 

->L-1 o-k 



lim lim (S^ \ Sl£_ L+1 ) = £. 



Together we get 



PMJ = lim (2H(L) - lim (S^" 1 , £TjU+i)) = 2£ - £ = E. 

L^oo k— s-oo 

The second equivalence follows with Corollary B.3, Corollary 4.4 and simple transforma- 
tions. □ 

More generally we can show that the PMI is bounded from above by the excess entropy. 



5.2 Proposition Assume that the PMI exists for a stationary stochastic process then 
it holds 

PMI < E. 

Proof. 

With Propositions 4.2, B.2 and the rule H{X\Y) < H(X) for two random variables X, Y, 
we get 



PMI = lim I(S-;S+) = lim (H(S~) - H(S~\S+)) 



= Hm(if(S-)-if(S-|1s T )) 

T— 5>00 

= lim \im(H(S-)-H(S-\SZ T T Zl +1 )) 

T— >oo L— >oo 

< lim \im(H(S-)-H(S-\SzU +1 )) 
= if(S") -^(5-1^) 



/(5-;<S + ) = £. 



□ 



5.3 Remark The PMI do not care about some random variables which are considered 
by the excess entropy. Proposition 5.2 tells us that PMI forgets this information and the 
excess entropy use the full information available from the realisations of the process. In 
this sense the PMI is a coarser complexity measure than the excess entropy. With that 
we get a graduation of the considered complexity measures from a coarse to a fine one, 
i.e. 

PMI <E<C£. 
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6 Explicit Representations 



We show a series of explicit representations of the PMI for simple processes. First we 
consider a simple case in which the metric entropy vanishes and periodicity is part of the 
process, i.e. periodic processes 2 . For that case the following corollary of Proposition 5.1 
give us the result. 

6.1 Corollary Let a stationary periodic process with period L he given. Then the per- 
sistent mutual information amounts to 

PMI = H(L), 

in particular it holds PMI = E. 
Proof. 

With Proposition 5.1 and the fact that hp = hold for periodic processes, the claim 
follows with the fact that E = H(L) for periodic processes. We show an additional more 
elementary proof of the corollary which shows the result for an iterated limit like (3.1) 
and which show the existence of the PMI for L-periodic processes. Because the process 
is L-periodic it holds that 

Pr(S 1 ,...,S L+1 )=Pr(S 1 ,...,S L ), 

and H(M) = H(L) for M > L. Consider M = L+l and the joint probability distribution, 
then one obtains with (— r — t — M + 1) mod L = k 

Pr (Sf = af,S— ^M+i^f) 

= P{{u : S?(u) = of A SZlZl M+1 {u) = g}) 

P({u : Sf(w) = o{ A S k (u) = 6 A . . . A S k (oo) = Cm}), if o x = <J L+1 , 
0, else, 

Pt(S± = of), if <7i = <T i+ i,£i = %M = <?k, ■ ■ ■ ,€m-1 = Cfc-l, 

0, else. 

With the definition of H(S^-\ SZ T T ZU M +i) ^ holds for M > L that 

H(S^~ 1 ,SZ-Z t t- M +i) = H ( L )- 
Finally we get for the persistent mutual information 

PMI = lim (2H(M) - lim lim H{S^\ SZ T T Zl M+l )) = H(L). 

□ 

For Markov-processes the PMI vanishes, since the dependencies between the past and 
future blocks disappear in finite time. 



2 A process is called periodic with period L if St = S t +L for all t G Z and S t ^ S t +k for k < L. 
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6.2 Proposition Let a Markov-process of order R he given 3 . Then it holds that 

PMI = 0. 

Proof. 

With the Markov-property and the abbreviation Sk :— SZ^-l+i ^ follows for —k<R 
HiS^ 1 | S k ) = - J2 Pr («?o _1 =*,S k = log(Pr(5 L - 1 = a\S k = 0) 

= - E Pr ^o L_1 = = Pr(S fe = log(Pr( > S L - 1 = a | S fe = 0) 

CT,£eA L 

-*<* _ J- PriSt 1 = <t) Pr(S k = MPr^- 1 = a)) 

= H(L). 
Hence the persistent mutual information is 

PMI = lim (h(L) - lim ff^" 1 | SZ k k _ L+1 )) = lim (ff(L) - H(L)) = 0. 

□ 

7 Example Processes 

In the following we investigate concrete examples of stochastic processes and calculate the 
PMI for them. 

7.1 Independent, identical, distributed Process 

A stochastic process is called independent, identical distributed if the finite dimensional 
distributions are independent and all distributions are equal, i.e. if for finite times t\ < 
. . . < t n e Z it holds that Pr(S tl , . . . , S tn ) = Pr(S tl )- . . .-Pr(S tn ) and Pr(S u ) = Pr(S tj ) for 
all i, j G {1, . . . , n}. The probability distributions are not depending on the time distance 
because they are identical distributed. Hence it holds that 

H(St +L -\SzU +1 ) = H(St\Szl). 

Hence the PMI coincide with the excess entropy E by definition. Furthermore we have 

Pr(5^- 1 ,5Zi) = Pr(5^- 1 )Pr(Sli). 

With the definition of the mutual information we get for every L e N 

KSt';SZl)= V Pr(Sf = st\S2l = szl) log y^-tTZ ■ = 0. 

>i,-i f ,L rr ^o - s o )^ t W-l- s -l) 

and finally 

PMI = E= lim IiSt 1 ] Szl) = °- 



3 A stochastic process is called Markov of order R if for all t > t n > ... > t > it holds that 
Pv(S t \S tn , . ..,S t0 ) = Pr(S t |5 tB , . . . , S tn . R+1 ) for all n > R. 
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7.2 Thue-Morse Process 



The Thue-Morse sequence has been discovered the first time in 1851 by Prouhet as a 
solution of the Prouhet-Tarry-Escott-problem. It was rediscovered in 1912 by Thue and 
1921 by Morse in different settings. The sequence appears in many different mathematical 
fields and is well studied. This diversity leads to a series of equivalent definitions of the 
sequence (there exists at least ten different ways to define it). We choose a definition 
which is based on substitutions and which is particulary easy (see Appendix C for a short 
introduction to substitution systems). The Thue-Morse sequence consists of a two symbol 
alphabet A = {0, 1} and is constructed with the following substitution 

g-.A^A 2 , with 0(0) = 01, 0(1) = 10. 

The Thue-Morse sequence is defined as the fixed point of Q with 

u := 0°°(O) = lim 0*(O) = 011010011001 . . . = G(u). 

t— >oo 

The stochastic process which generates the Thue-Morse sequence is called Thue-Morse 
process. We want to calculate the PMI for that process. The used probability measure 
is the counting measure. To be precise for a block of symbols F of length n in u we define 
the counting measure as follows 

Pr(F) := lim - \{n < j : u n . . . w n +|F|-i = F} I . 

J^OO J 1 1 

We denote a block of symbols of length n in u as factor. We can calculate the frequency 
and hence the probability of a factor in u with spectral analytical methods [Que87]. One 
can show that this substitution system is uniquely ergodic 4 (see Appendix C for a more 
detailed discussion). Hence the Thue-Morse process is a stationary stochastic process 
and we can calculate the PMI. For that we follow [Ber94] to calculate the frequencies 
of factors in u. The key for that calculation is the following lemma which is proved in 
[Fog08] and in [Que87]. 

7.1 Lemma Every factor F oi length n > 4 in u has an unique preimage (up to possibly 
appearing border terms) w.r.t. the substitution Q. 

Proof. 

First we show that the Thue-Morse sequence does not contain blocks of symbols with 
more than two identical consecutive symbols. Otherwise if the block 000 exists in it, there 
need to exist a symbol a e A with Q{a) = 00. But with the definition given above that is 
not the case and thus all blocks with zeros of greater length than two are also excluded. 
Analogous one see that also no blocks of ones with length greater than two appear in 
u. With the same argument one can show that there are no blocks of the form 01010 or 
10101 in u, since the preimage of such blocks would be 000 or 111. That means that in 
F at least one of the blocks 00 or 11 appear. Split F into blocks of length two such that 

4 A substitution system is called uniquely ergodic, if there exist a unique invariant probability measure. 
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none of these smaller blocks is 00 or 11 (possibly there remain some boundary blocks of 
length one). This splitting is with the remarks above uniquely determined and gives us 
the unique preimage of F. □ 



This fundamental property is also known as recognizability-property of a substitution sys- 
tem, see [Que87] for more details. In [Dek92] Dekking shows the following proposition. 

7.2 Proposition Factors of length n > 2 in the Thue-Morse sequence with 2 k + 1 < n < 
2 k+1 , k G No have the following frequencies 

1 1 

3 • 2 fc ' 6 • 2 k ' 

Factors of length 1 appear with frequency ^. 
Proof. 

We prove the claim by induction on k. 

For k = and k = 1 we have that ?r, = 2orn = 3,4 respectively and calculate the 
frequencies like in [Que87] via a spectral analysis (see Appendix C for details) and get 
I'l or 3^2' 6^2 res P ec ti ve ly- Assume now that the claim is proved for a k and we want to 
show the induction step from k to k + 1. It holds that 2 k+l + 1 < n < 2 k+2 and with 
Lemma 7.1 a factor F of length n > 4 have an unique preimage F' . Since for a p > n the 
frequency of F in the first 2p letters of the Thue-Morse sequence equals the frequency of 
F' in the first p letters (this follows from the construction of the Thue-Morse sequence) 
for the frequency of F in u it follows that 

With the induction assumption the claim follows. □ 

If a factor can be continued by adding a letter to the right in at least two different ways 
such that the continued symbol block is also a factor in u, we call such a factor a right 
specialfactor. In our case this means that for a right specialfactor B the words BO and 
Bl are also factors in u. Dekking also proved the following Lemma [Dek92]. 

7.3 Lemma Let F be a right specialfactor of length n > 2 in u and 2 k + 1 < n < 2 k+1 . 
Then F has the frequency and the right extensions of F have the frequency 

From that the following important proposition can be derived, which gives us an explicit 
expression for the first derivative of the block entropy. 

7.4 Proposition ([Ber94]) For all k > 1 we have the following explicit expressions for 
the first derivative of the block entropy AH(n) := H(n) — H(n — 1) 

AH(n) = if 2 k + 1 < n < 3 ■ 2 k ~\ 

AH(n) = if 3 • 2'- 1 + l<n< 2 k+1 . 
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Proof. 

We use the abbreviation r](x) := —xlog(x). The first derivative of the entropy can be 
written as 

AH(n) = ^(Pr(-BO)) + r)(Pv(Bl)) - n(Pv(B)), 

where S n is the set of all right specialfactors of length n. The cardinality of S n is given 
by the complexity function p : N — > N. p(n) is defined as the number of factors of length 
n in u and so we have \S n \ = pin + 1) — pin). In [deL89] the following property of pin) 
for u is shown 

p{n + 1) - p(n) =4, if 2 k + 1 < n < 3 ■ 2 k ~\ 
p(n + 1) - p{n) = 2, if 3 • 2* _1 + \<n< 2 k+1 . 
Using Lemma 7.3 we obtain for 2 k + 1 < n < 3 ■ 2 k ~ 1 



AH(n) = (p(n + 1) - p{n)) ( 2n 



1 



Q-2 k J '\3-2 k JJ 3-2 k ' 
The claim for the case 3 • 2 k ~ 1 + 1 < n < 2 k+1 follows in an analogous way. □ 

Thus the metric entropy vanishes hp = lim AH(n) = 0. Since the Thue-Morse process is 

a one-sided process it is enough to consider H(L) and H{S\'^ j -1 , With Proposition 

7.2 we get 

p(L) 



ff(L) = - P < S ° 1 = ^ ^ P < S o 1 = °) > Cf-^ log(L - 1), 



with a finite constant C. Furthermore one observe that due to Lemma 7.1 (which is 
essentially the recognizeability-property of the substitution systems) a gap between two 
symbol blocks is uniquely determined with the blocks of the border. More precisely that 
means that with given borderblocks Sj^j 1 ^ 1 = a and Sq^ 1 = £ there is exactly one 77, 
such that S^ 21 - 1 = ^na holds. The same holds for the probabilities such that with 
2 r + 1 < 2L + t - 1 < 2 T+1 and Proposition 7.2 we get 

H{SlXf-\ St 1 ) = ~ £ JM^f- 1 = v, St 1 = logPr(S£f 1 = a, St 1 = 

< l^g^rtogce-r). 

Finally it follows that 



-^(fiSf-SSf- 1 ) > -|^i 2i 3-^log(6-2^) 'TO, 

and thus 



/(SSf- 1 ; ^o^ 1 ) = (2H(L) - H(Sltl L -\ St 1 )) > 2H(L) - \A\ 2L ^^ ^ 2H(L). 
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Due to 2H(L) — > oo it holds that PMI = oo (remark that one divergent sequence 
in the double sequence is enough to derive the divergence of the double sequence). In 
particular one can also show that E = oo. There seems to exist a whole set of further 
substitution processes for which the PMI is infinite. 



7.3 Persistent Mutual Information for an one-dimensional Ising- 
spinchain 

We calculate the PMI for an one-dimensional Ising-spinchain. Due to the fact that the 
spinchain is a Markov-process of first order it immediately follows from Proposition 6.2 
that PMI = 0. 

7.5 Remark Crutchfield et al. calculated in [Cru97] and in [Fel98] an explicit expression 
of the excess entropy E for that example, see also [GmelO] for a detailed treatment. It 
turns out that depending on the temperature the excess entropy attains a maximum at 
some critical temperature and get close to zero for very low and very high temperatures, 
see Figure 1. 



E 




12 3 4 

Figure 1: Excess entropy for one-dimensional Ising model depending on the temperature T. 



It is well known that in the one-dimensional Ising model no phase-transition appears (we 
consider a phase-transition as an example for weak emergence). Nevertheless the fact that 
E attains a nontrivial expression in that case and PMI is zero shows us that E seems to 
measure complexity structure at a too fine level and on the first sight cannot distinguish 
between emergent structures and not emergent structures. On the other hand the fact 
that PMI is zero supports the intuition that PMI only detects emergent structures. To 
confirm this intuition we need more concrete calculation examples, see Section 8 for a 
detailed discussion. 
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8 Emergence 



After defining mathematical measures for complexity we want investigate their relation 
to emergent structures appearing in nature. In particular we try to answer the question 
to what extend the introduced complexity measures and in especially the PMI are able 
to detect emergent structures. Before we start with that we need to write down precisely 
what the term emergent structure and emergence means. In many works considering this 
topic this is often a crucial part since the term emergence is often misunderstood and used 
without a precise definition. There is a vast and confusing usage of the term emergence 
in the literature for many situations which seem to have something in common but differ 
at some point. Furthermore many people argue on an intuitive level and do not define 
emergence in a precise way. A similar difficulty seems to exist for the term complexity. 
There are a lot of papers concerning complexity but often a clear mathematical definition 
is missing. However we try in this section to give a clear description of emergence (at 
least we want to define the meaning of the term in our sense). For that we start with a 
short overview and go back to the roots of emergence which can be found in philosophy. 

8.1 Emergence - an Artificial Expression in Philosophy 

The term emergence is basically an artificial expression in philosophy which is nowadays 
spreaded in many different scientific disciplines. The starting point of emergent thinking 
goes back to Henry Lewes (1817-1878) and Stuart Mill (1806-1873). The golden age of 
emergentism took place in the 1920s, mainly in Great Britain. During this time many 
authors developed, independent of each other, different theories of emergence. In par- 
ticular the work "The mind and its place in nature" [Bro25] of CD. Broad published in 
1925, was one of the most discussed work. Even today a lot of researchers take this work 
as the foundation for a philosophical definition of emergence. At the moment the term 
emergence experienced a Renaissance in the philosophy of mind. We do not want to go 
into further historical details here. The interested reader will find a good treatment in 
[Ste99]. Instead of this we want to present the modern viewpoint of philosophy towards 
a definition of emergence. 

Before we write down a philosophical definition of emergence we must define what we mean 
by a system. The definitions we state here has to be understand in a philosophical sense 
(so the formulations are very general) and it is a different question if one can implement 
these definitions in a meaningful way into natural sciences, like mathematics or physics. 
Furthermore it is the nature of philosophical definitions that they contain fuzzy terms and 
concepts. We cannot treat and discuss every detail here and refer for a more extensively 
discussion to [Ste99, Bec08]. 

8.1 Definition A system consists of a set of components C = {C\, C2, . . .} and a set of 
relations R between these components. We denote a system by S := (C, R). The proper- 
ties of the components and of the relations are called microstructure of the system. 

8.2 Definition A property of a system is a characteristic feature of the system which 
is reflected in the microstructure of the system. If a system has a property E, but none of 
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its components or subsets of components have the property E, then we call this property 
a macro-property. 

Based on the historical theories of emergence Stephan defines in [Ste99] different ver- 
sions of emergence by stating characteristic features of systems which have such emergent 
properties. 

8.3 Definition (Weak emergence, [Ste99]) A property E of a system S = (C,R) is 
called weak emergent, if the system has the following features. 

(i) (physical monism ) The system S has only physical components and every entity of 
the world is composed by physical components. 

(ii) (systemic property) The property E is systemic, that means that no component 
or subset of components of the system have the property E. Therefore E is a 
macro-property. 

(Hi) (synchrone determinacy or supervenience) The property E depends nomologically 
on the microstructure of the system. The behaviour and the properties of a system 
are therefore determined by the behaviour of its components. 

The first item in the definition of a weak emergent property is a purely philosophical 
requirement. In formal and mathematical theories this requirement is out of debate since 
in natural sciences one always has the belief that the world is assembled by physical 
components. The second required feature for weak emergence is equivalent with that of 
a macro-property, such as those occuring in statistical mechanics or other theories. The 
third required feature is a one-sided dependency relation, which said that the macro- 
property depends on the microstructure of the system. That means the following: The 
macro-property cannot change, if there is no change in the microstructure of the system. 
There cannot exist another system with the same macro-property but with a different 
microstructure. We say that the macro-property supervenes over the microstructure. 
Therefore in the literature the term "weak emergence" is also known as supervenience. 
By adding further features we can strengthen the term of weak emergence. 

8.4 Definition (synchronous emergence, [Ste99]) A property E of a system S = 

(C, R) is called synchronous emergent, if E is weak emergent and additionally has the 
following feature. 

(iv) (irreducibility) The property E is irreducible. That means the property E 

(a) cannot be analyzed from the behaviour of the (isolated) components of the sys- 
tem. This inability to determine that S has property E is a principle limitation 
(so even if we know everything one can know about the single components it 
is still impossible to detect E from that knowledge). 

(b ) Or from the behaviour of the components of S in different constellations (with 
different relations) it is in principle impossible to deduce that S has property 
E. 
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The formulation that something is in principle impossible means that no scientific progress 
can change that fact. Therefore synchronous emergence is not an epistemo logical expres- 
sion and not related to scientific knowledge (often that crucial fact is misunderstood in 
the literature and emergence is seen as an expression relative to scientific progress). 
There remains the question what exactly does it mean that a property E can be deduced 
from the microstructure of a system S. Broad does not say anything about that in his work 
but Beckermann give an interpretation which shed some light on it. He says the following: 
A property E can be deduced from the microstructure of a system S if and only if one 
can deduce from the general laws of nature that every system with that microstructure 
consists of all features which characterises E [Bec08]. 

So far we considered systems without a time component. Adding a time component we 
can define an equivalent strong expression of emergence for time-depending systems. In 
such systems the emergent property develops during time course. 

8.5 Definition (diachrone emergence, [Ste99]) A property E of a system S = (C,R) 
is called diachrone emergent, if E is weak emergent and additionally has the following 
features. 

(v) (novelty) The property E is genuinely new, that means that E not appeared at an 
earlier time. 

(vi) (structure unpredictability) It is in principle impossible to predict that property E 
will appear during time course of the system. 

Like before the expression of diachrone emergence is not an epistemological expression. 
Strictly speaking the novelty postulation means that a property E of a system S has never 
been seen before, even in other systems E has not be seen before. So E appeared the first 
time ever. 

Unpredictability means that in principle one cannot predict a property E of a system S 
from the knowledge of the underlying microstructure of the system S. Stephan argued in 
[Ste99] that synchronous and diachrone emergence are equivalent forms of emergence (up 
to the time component). 

However in the context of stationary stochastic processes only synchronous emergence 
is interesting, since a stochastic process with a diachrone emergent property need to be 
non-stationary (see also the remarks in [Set08]). 
Beckermann defines synchronous emergence in a more compact way. 

8.6 Definition ( [Bec08] ) A macro-property E of a system S with microstructure (C, R) 
is synchronous emergent if and only if 

(a) The sentence: "All systems with microstructure [C, R) own the macro-property E" 
is a valid law of nature, but 

(b) E cannot be deduced in principle from the full knowledge of all features the isolated 
components C own or they have in different arrangements. 
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Postulation (a) is basically the same as supervenience (like in the definition of weak 
emergence). Though (a) means a bit more. Beckermann stressed that the sentence "All 
systems with microstructure (C, R) own the macro-property E" is a valid law of nature has 
to be understood as follows: The law of nature is not a special case of an already existing 
law of nature and cannot be deduced by combining existing laws of nature. Consequently 
one has to discover this law of nature for the first time and it has to be accepted as a law 
of nature. This is a very strong requirement and one can see that this kind of (strong) 
emergence appears very rarely. To be precise it is not clear at all if such a strong version 
of emergence even exists in the real world. However the definition of Beckermann and the 
definition of Stephan are equivalent. 

For our research and the treatment in this paper we take these philosophical definitions of 
emergence as a basis. Because of the unclear situation depending the existence of strong 
emergence in the real world we only consider weak emergence and try to formalize this 
concept mathematically. 

8.2 Examples of Emergence 

Before we consider mathematical definitions of emergence we give a few examples of 
emergent properties appearing in the real world. 

8.2.1 Weak Emergence 

There are numerous examples for weak emergence. We consider only three well-known 
examples. 

• Flight structure of migratory birds and swarm behaviour in nature. 

Observing swarms of animals (in particular swarms of birds or fishes) and their 
behaviour is a fascinating spectacle. The swarm seems to have an own dynamic 
which is not controlled by a central entity. Instead of this a kind of self-organisation 
seems to be responsible for the dynamic. The behaviour of the swarm supervenes 
over the single individuals. There are simple mathematical models which model such 
a behaviour. For example Cucker and Smale showed analytically for such a model 
(consisting of differential equations) that it converges under certain preconditions 
against a stable solution [Cuc07]. In their work the dynamic of the centre of mass 
of the swarm is the emergent macro-variable. The single trajectories of the swarm 
components corresponds to the microstructure. Cucker and Smale showed that 
under some conditions the centre of mass converge against a stable solution. That 
means that the whole swarm behaviour developes from a chaotic looking behaviour 
to a well structured behaviour. This well structured behaviour is the emergent 
property of the dynamical system. 

Instead of that Seth defined a measure for weak emergence, the so called G-emergence 
(see [Set08]). He calculates this measures for similar swarm models. Changing dif- 
ferent parameters in his model he observe numerically that the G-emergence attains 
a higher value the more the swarm has a stable movement structure. On the other 
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hand if the individuals of a swarm behave completely random the G-emergence 
attains values near zero. 

• Neuronal networks. 

A neural network (or artificial neural network) is a network imitated from the net- 
work structure of neuronal cells in the human brain. It consists of neurons and 
weighted connections between the neurons. The topology of the network is usually 
fixed, so that the weights are the only changeable parameters. Every neuron owns 
an appointed threshold and can accept input values from an external user or from 
other neurons. This input is multiplied by the connection weight and sumed up. 5 If 
this sum is higher than the threshold of the neuron then the neuron fires an output 
signal to its successor neurons or to the user. So the whole network works as follows: 
The user sends an input signal into the network which is passed through the network 
and the user gets back an output signal. 

Neural networks are often used to classify objects or for forecast purposes. For 
that the networks are initially trained with a labeled training set. To minimize 
misclassifications one can change the weights between the neurons. There are a 
lot of different learning algorithms, like the back propagation algorithm, which are 
suitable for that task. After training the network sufficiently well it can be used for 
new classification tasks. The big advantage of a neural network is its flexibility and 
its ability to learn a specific behaviour. From a mathematical point of view a neural 
network is a dynamical system and one can show that under very mild assumptions 
it can approximate every nonlinear and non continuous function. The disadvantage 
is that it is a-priori not clear what kind of topology one has to choose to solve a 
given classification problem with a neural network. We do not want to enter closer 
into this problem and refer to [Sta91] for a more detailed treatment of that problem. 

Instead of this we look at emergence in such networks. As a macro-property we 
specify the classification ability of a trained network. The microstructure consists 
of the neurons and the connections between them. It is obvious that the macro- 
property is a systemic property, since no part of the microstructure and no single 
neuron can have the ability to classify objects in the same way as the whole network 
does. The macro-property also supervenes about the network structure and the 
corresponding weights, because if one change some part of the microstructure also 
the ability to classify objects will change. Because of that the macro-property is a 
weak emergent property. 

But the macro-property is a reducible property, since with the knowledge of the 
microstructure one can completely explain (at least in theory) the macro-property. 
Therefore the macro-property is not synchronous emergent (see also Chapter 17 
in [Ste99]). Furthermore the learning process of the macro-property is also not a 
case of diachrone emergence. One can calculate the changes of the weights exactly 
and one can theoretically estimate when the performance of a network is below a 

5 In general there are many different possibilities to process the input values in a neuron. To simplify 
life we only consider one possibility in this paper. 
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given error bound. The learning process in a neural network is nothing else than an 
optimization of a multidimensional function. So the macro-property is also not an 
example for a diachrone emergent property. 

• Phase transitions. 

Everybody knows phase transitions from everyday life. For example consider the 
change of fluid water to solid ice. This is considered as a phase transition. In 
mathematical language a phase is defined as a pure probability measure for a given 
model. Consider now the well-known Ising-model. One can show that the set of 
asymptotic Gibbs measures is not empty and a convex set [Kna06]. A pure Gibbs 
measure is a Gibbs measure which cannot be written as a convex combination of two 
other Gibbs measures. In the one dimensional Ising-model there exists exactly one 
Gibbs measure and there is no phase transition. In the two dimensional Ising-model 
there are two Gibbs measures in the low temperature region and there is a phase 
transition at a critical temperature. Below that critical temperature the system 
remains in one of the two alternative states. The microstructure in the Ising-model 
consists of single spins and the interactions between them, which are described by 
the energy function. As a macro-property we can choose the mean magnetization. 6 
The mean magnetization is a systemic property by definition, since every spin has a 
direction but does not reflect the characteristic features of the mean magnetization, 
namely the disappearing variance. The mean magnetization supervenes over the 
spins and the interactions between them. This is due to the fact that if one changes 
the interactions between the spins than also the mean magnetization will change. 
Therefore the spontaneous magnetization is a weak emergent property. 

8.2.2 Strong emergence 

A rigoros proof for the existence of strong emergence in the real world as it has been 
defined in the previous section is still missing. Some experts in the theory of emergence 
say that the only serious example discovered so far for strong emergence are mental states 
and similar phenomena of consciousness. In philosophy, mental states are sensations like 
pain or intensions like beliefs, hopes, etc. ([Bec08], p. 17). Such a mental state can be 
seen as a macro-property of the human brain which is composed of physical components 
considered as the microstructure. There is a wide acceptance among the experts that a 
mental state like pain is determined by the underlying microstructure and thus is a weak 
emergent property. But there are also a lot of people who stress the fact that it is up 
to now impossible to reduce a mental state to its physical microstructure (which are just 
physical states) and thus explain it in a physical way. Some of the experts are convinced 
that no progress in science can change that situation. There are also people who claim the 
opposite. Another argument for the existence of strong emergence is downward causation. 
This means that the direction of causality is reversed. So the macro-property which 
was determined by the microstructure acts now back to the microstructure and influence 
it. This kind of feedback loop brings the whole system into a stable state. People who 
believe in downward causation often give the following example. Suppose an individual 

6 The mean magnetization is the mean value of the spin values. 
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has the mental state fear. This mental state is determined by the underlying physical 
structure, but one can measure an increase of the pulse and also the change of lot of 
other physical properties can be measured (in this scenario the whole physical body is the 
microstructure). So one can think that the mental state changes the physical structure 
and thus the microstructure of the system. Critics, however, are of the opinion that a 
mental state cannot determine something. The problem is that a clear definition of mental 
state is missing. In any case there is a big discussion about that problem and for further 
readings we want the reader refer to [Ste99, Bec08, Cha02]. 

There may certainly be a number of examples where one can suppose strong emergence. 
For example Chalmers suppose that some phenomena appearig in quantum physics could 
be considered as an example for strong emergence. But in his treatment a clear argument 
is missing [Cha02]. In a summary we can say that up to now we are not sure if we can 
find strong emergence in the real world and also no proof exists that show that we cannot 
find it. 

8.3 Mathematical Models for Emergence 

After defining and discussing the term emergence from a philosophical point of view, we 
now want to look at it from a mathematical point of view. Indeed there are some theories 
which have the ability to detect emergent phenomena but are not able to give a clear 
definition how emergence can be understood in mathematical terms. We just mention a 
small selection of the possible attempts to formalize emergence. In particular we want to 
consider information theoretic models for emergence. 

8.3.1 Bifurcation Theory 

Bifurcation theory deals with the question if a solution of a parametrized dynamical 
system is stable or not and with the question for which parameters it becomes stable. For 
an introduction into the theory and a detailed treatment see [Guc83]. 
What is the connection between bifurcation theory and emergence? In this section we try 
to give an answer to that question. Consider a parametrized dynamical system which is 
described by a set of equations (for example a system of differential equations). These 
equations describe in an implicit way the microstructure of the system (implicit because 
typically only macro variables appear in the equations). This microstructure can be 
changed via changing the parameters. One can imagine the single components of the 
system as solution curves of the dynamical system for different initial values. As a macro- 
property one can choose multistability of the system (that means that there exists several 
stable solution branches). This macro-property is a systemic property, since no single 
solution can have the property of multistability. Furthermore the multistability depends 
directly on the microstructure, since it changes with changing some parameters (remember 
that parameters belong to the microstructure) of the system. So multistability is a weak 
emergent property. Since one can (at least in principle) determine from the equations 
for which parameter values a bifurcation occurs and, thus multistability, the property 
of multistability is not strong emergent and can be deduced from the microstructure. In 
summary we can say that with bifurcation theory one can detect cases of weak emergence, 
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but it is not clear if every weak emergent property can be detected in that way. There are 
situation in non-equilibrium in which it is difficult to detect bifurcations. Maybe there 
are also much more complicated emergent properties in nature that cannot be modeled 
by such kind of systems. Furthermore a direct link to the microstructure of the system is 
missing and that is a further reason why we follow an information theoretic approach for 
defining emergence. 

8.3.2 Synergetics 

Another model which is related to bifurcation theory is synergetics which was introduced 
by Haken in the 1960s [Hak83]. Haken tried with his theory to explain the evolution of new 
system properties. Often he considered structures which appear spontaneously through 
a self-organisation process. A classical example is the appearance of laser light from an 
ordinary light source which is feed permanently with energy from outside. After exceeding 
a certain amount of energy laser light appears. From a mathematical point of view one can 
consider synergetics as a method to approximate solutions of high dimensional differential 
equations, see [Jet89, Hak83] for examples. Basically Haken introduce few artificial macro 
variables which determine the main behaviour of a system of equations and neglect the 
other remaining variables. So one can imagine that these macro variables determine the 
behaviour of the microstructure (that would be a case of downward causation). But it 
is not clear if the macro variable can be deduced from the microstructure, since it was 
introduced artificially. Due to this fact it is not even clear if such macro variables can be 
considered as properties of the system. At least from Haken's point of view it remains 
questionable if such macro variables can be seen as emergent or not. 

8.4 Information Theoretic Definitions of Emergence 

Next we want to have a look at some information theoretic approaches to formalize emer- 
gence. We will shortly discuss two different approaches which are related to excess entropy 
and statistical complexity. In particular we discuss if complexity measures like the PMI 
are suitable to detect emergence. 

8.4.1 Emergence as Reduction of Complexity 

Shalizi and Crutchfield [ShaOl, Cru94] suggest a mathematical definition of emergence 
as follows. First they define a quantity which measure the efficiency of prediction of a 
stochastic process. 

8.7 Definition (Efficiency of prediction, [ShaOl]) Tie efficiency of prediction of 

a stationary stochastic process is the ratio between its excess entropy and its statistical 
complexity 

e 

where e + := if Cp = and e~ 



E 

Cf 1 



E 



ifCl 



0. 
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From the properties of the excess entropy it follows that 

< e + < 1, < e~ < 1. 

The efficiency of prediction tells us how much of the internal process information can be 
actually used for predicting future process behaviour. 

8.8 Definition (Derived process, [ShaOl]) A stationary stochastic process 

(S' t ,t G Z) is called derived from another stationary stochastic process (S t ,t G Z) if and 
only if there is a measurable function f : S — > G in a measure space (G, Q) such that 
S' t := f(S t ). (S' t ,t G Z) is called the derived or filtered process and the function f is 
denoted as filter. 

Based on that Shalizi defines emergence as follows. 

8.9 Definition (Emergent Process, [ShaOl]) A derived stochastic process is emer- 
gent, if it has a greater efficiency of prediction e + than the process it derives from. We 
then say the derived process emerges from the underlying process. 

8.10 Definition (Intrinsic Emergence, [ShaOl]) A process is intrinsic emergent, 

if there is another process which emerges from it. 

Shalizi justified his definition on the following basis. At the one hand Shalizi's idea of 
emergence is that an emergent property supervenes over the components of a system (this 
idea coincide with the definition of weak emergence seen before). On the other hand he 
assumes that the appearance of emergence implies a simplified description of the system. 
This idea he describes with a reduced complexity as it is formalized in Definition 8.9. 
Shalizi says that Definition 8.8 represents the assumption of supervenience. It remains 
questionable if based on such a vague argument a reasonable definition of emergence is 
possible. The problem is that Shalizi not clearly defines what he means with system and 
emergent properties in a philosophical sense. Because of that lack of philosophical basis 
we use the definitions of emergence from Section 8.1. 

Within this setting we had to consider a derived process V < as a macro-property ot the 
underlying process "ts*" and the random variables St together with their correlations are 
forming the microstructure. By definition the macro-property is a systemic property. 
Furthermore "ts^"' supervenes over the microstructure, since if one changes the underlying 
process ^ in general the process *S*' will also change (but there exists filters such that 
this is actually not the case). In such situations the derived process can be classified as 
weak emergent. If the derived process additionally has a higher efficiency of prediction 
than the underlying process, than there is more information stored in the realisations of 
the derived process as in the underlying process. In the extreme case one has Cp~ = E 
and with Corollary B.3 that is the case if and only if S + = g(S~). That means that the 
future causal states can be completely deduced from the past causal states. 
In the derived process we can "better" deduce future causal states from past causal states 
as in the underlying process but it is not clear how this is related to strong emergence. 
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In [ShaOl] Shalizi also state a concrete filter function to construct a derived process. 
Unfortunately further examples and results are missing and more evidence (in form of 
examples) are necessary to check if this definition of emergence is reasonable or not. But 
nevertheless it is an interesting approach. 



8.4.2 Model of Emergent Description 

In [Pol04, P0IO6] Polani suggest an emergent description of dynamical systems. Inspired by 
the theory of synergetics by Haken, he states an information theoretic decomposition of a 
dynamical system into information-preserving and independent subsystems. We consider 
the set of all realisations S of a stationary stochastic process. Polani decomposes this 
set into finite many components. Such a decomposition is given in form of k random 
variables : S — > i = 1, . . . , k with \J^ =1 S^ = S . The random variables are 

not further specified and it remains unclear under which conditions such a decomposition 
exists. If one assumes that it exists then one can imagine it like depicted in the following 
diagram. 




With that Polani defines the emergent description of a process. 

8.11 Definition (Emergent description, [P0IO6]) Let a stationary stochastic process 
with a decomposition in k random variables be given. Then the k random variables 
are called an emergent description of V, if 

(a) the decomposition is a complete representation of the systems: 



I(St;S., 



(i) 
t j 



Sl k) ) = H(S t ), 



(b) the individual components of the decomposition are independent of each other: 

(c ) and the components are information- preserving in time: 

I(SP;SI%) = H(SI%). 



The significant difference to the previous approaches is that the whole realisation space 
will be decomposed and thus the whole process will be decomposed. Figure 2 shows a 
schematic scetch of that situation. 
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Figure 2: Schematic scetch of an emergent description as a decomposition into independent 
components. 



An emergent description is a decomposition in information-independent components which 
preserve their information for all times. Unfortunately also for that model there are no 
results known which guarantee existence of such a decomposition. Even Polani does not 
give any explicit example in his work. Because of that lack of knowledge it is difficult to 
judge this proposal in a reasonable way. With the facts known up to now it seems not 
possible to check the relation of that description to the definition of emergence given in 
Section 8.1. 

8.4.3 Complexity Measures as Definitions of Emergence 

We now want investigate how well the complexity measures excess entropy, statistical 
complexity and persistent mutual information suit to define and detect emergence. We 
shortly repeat how one can understand and interpret the different complexity measures. 

• Excess entropy E: The amount of past information which is currently available and 
communicated into the future. In particular it represents the amount of information 
one can extract from a concrete (past) realisation to make predictions for future 
realisations. 

• Statistical complexity Cp~: Amount of information stored in the future causal states. 
In other words it is the amount of information about the past which is stored in 
the process to predict future in an optimal way. In general a concrete realisation 
contains less information than the process has internally stored. 

• Persistent mutual information PMI: The amount of past information which is 
currently available and which is communicated into a very far future. In other 
words: the amount of information one gets from a concrete past realisation and 
which will be preserved for all future times and for all future realisations. 

As we proved before we have 

PMI <E<C£. 

So we have a gradiation from a fine complexity measure Cp to a coarse one, the PMI. 
If we consider regularly structures, like L-periodic processes, then we establish that all 
three complexity measures coincide. 

Such processes can be generated by dynamical systems with a periodic behaviour. One 
example is the logistic map which produce periodicity for certain parameter values. It is 



28 



defined with a parameter r as follows 



T r : [0, 1] [0, 1], T r (x) := rx(l - x), re [0, 4], 

x n+1 := T r (x n ), x e [0, 1]. 
Furthermore we define a random variable S : [0, 1] — >■ A by 



S(x n ) := 



0, x n E [0,0.5], 

1, x n e (0.5,1], 



and the alphabet ^4 := {0, 1}. The discrete time series which is produced by x n+ i = T r (x n ) 
for an arbitrary initial value Xq G [0, 1], is called trajectory. Together with a cr-algebra 
and a T-invariant probability measure 7 we get a stationary stochastic process. A subset 
A of the phase space [0, 1] is called invariant under T r , if T r (A) C A. 

8.12 Definition A closed invariant set A is called attracting set, if there is an envi- 
ronment U of A such that for the flow (T t ) t6K of T it holds that 

lim d(T t (x),A) = 0, MxeU. 
d denotes the corresponding metric in the phase space. 

8.13 Definition A (compact) invariant set A of the phase space is called attractor, if 

A is an attracting set which contains a dense trajectory. 

For suitable initial values all trajectories of a dynamical system tend to an attractor of 
the dynamical system. 

We can draw an attractor in a graphical way to get an overview of the long-term behaviour 
of the dynamic. 8 Figure 3 shows the attractor of the logistic map for parameters in the 
intervall [2.5,4]. 

In the lower parameter region one observe a periodic behaviour of the logistic map. For 
example if we pick r = 3.2 then the period is 2. The random variable S codes such a 
period in the generated symbol sequence such that we get a periodic stochastic process. 
For such processes the values of Cp~, E and PMI are equal. If we look at the attractor we 
recognize some parameter values for which the period doubles if we pass them. At this 
points a bifurcation occurs and new solution branches appear. The periodic behaviour 
corresponds to multistability we considered in Section 8.3.1. We have already seen that 
multistability is a weak emergent property of the system. All three complexity measures 
detect this weak emergent property due to the non trivial values they have. Up to that 
position every complexity measure is equal powerful to detect weak emergent properties. 
But when we ask if every structure is weak emergent where these complexity measures are 
assuming positive values, we will see the differences. Table 1 shows a summary of differ- 
ent examples together with some calculated entropic expressions and complexity measures. 



7 The measure depends on r and its existence is a priori not clear. 

8 For that we fix r and choose an initial value xq and calculate the corresponding trajectory. Now one 
draws the points of the trajectory depending on the parameter starting after a few hundred iterations to 
avoid numerical artifacts. 
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Figure 3: Attractor of the logistic map for parameter values r G [2.5,4]. 
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Table 1: Tabular overview of structure quantities corresponding to example processes. 

Consider the one-dimensional Ising-model. In the low temperature region the excess 
entropy assumes positive values (compare Figure 1). Also the statistical complexity give 
positive values. We already know that there is no phase transition in the one-dimensional 
Ising model. So there is no weak emergent property to detect (at least if we choose the 
philosophical definition given in the section before). This indicates that the excess entropy 
and the statistical complexity are not suitable for detecting weak emergent, since they 
detect something (assume positive values) although there is nothing interesting to detect 
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(in the sense of emergence). It seems that E and Cp can detect interesting structures 
of dynamical systems but they are too fine to filter weak emergent structures. Only 
the persistent mutual information is zero in that examples (except for the Thue-Morse 
example). So the only remaining candidate for detecting emergent properties is the PMI. 
But the calculated examples so far are not enough to decide if PMI really detect emergent 
properties. In particular an example is still missing where PMI assumes a non trivial 
value and clearly differs from E and Cp. But there is some numerical evidence. Ball, 
Diakonova and MacKay numerically calculated in [BallO] the PMI for the logistic map. 
From the numerical results one recognize that in the parameter region r G [3.58,3.68] so 
called "chaotic bands" appear which are detected by the PMI. Roughly speaking chaotic 
bands are disjoint regions in the phase space [0, 1] between which a trajectory changes 
periodically but inside the region behaves in a chaotic way. If this is a further example 
for a weak emergence property is an open question. In particular a rigoros analytical 
investigation and calculation of the PMI for that parameter values is missing. 
In summary we can say that the PMI is the most promising complexity measure among 
the ones investigated here for detecting weak emergent properties of systems. But there 
are still examples and rigoros results missing to further confirm this conjecture. 
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A Information-theoretic Facts 



A. l Proposition For the mutual information and two random variables X, Y it holds 

(i) I(X; Y) > 0, with equality iff X and Y are stochastic indpendent. 

(ii) I(X;Y) = I(Y;X). 

(in) Let X = (Xi, X 2 , . . .) be a stochastic process then it holds that 

I(X;Y)= lim I((X 1 ,...,X n );Y). 

n— >oo 

Proof. 

See [Pin64] Chapter 2.2. □ 

B Excess Entropy 

The metric entropy is defined by hp := lim^oo ^-j^- and give us a geometric interpretation 
of the excess entropy. 

B. l Proposition ([Gra86]) It holds that 

E= lim (H(L) — Lhp), 

L— >QO 

Proof. 

We write E as the limit of partial sums and use discrete integration 

M 

^2(AH{L) - hp) = H(M) - H(0) - Mh P . 

L=l 

Because of H(0) = it follows that 



E = lim (H(M) - Mh P ). 

M->oo 



□ 



B.2 Proposition ([E1109]) For a stationary stochastic process V it holds that 

E = I(S + ;S-). 

Proof. 

To prove the proposition we use a four random variable mutual information introduced 
in [Yeu91] and follow the same strategy as in [CrulO]. For random variables X, Y, Z, U 
we define 

I{X-Y;Z-U) := I{X;Y;Z)-I(X;Y;Z\U), 
I(X;Y;Z) := I(X; Y) — I(X; Y\Z), 

with I(X; Y\Z) := H{X\Z) - H(X\Y, Z), 
I(X;Y;Z\U) := I(X; Y\U) - I(X; Y\Z; U), 

with I(X; Y\Z; U) := H(X\Z, U) - H(X\Z, U, Y). 
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Furthermore we use the following two identities which hold for a measurable function / 
of a random variable X (see Lemma 2.5.2 in [Gra90]) 

H{f{X)\X) = 0, H(X,f(X)) = H(X). (B.l) 

We define mappings g : A N —> {1, . . . , m} with g(a) := j if a G S~ and / : A~^° — > 
{1, . . . , n} x A with f(cr(Ta) := (z, cr ) if cr G . Since we are considering e-machines the 
mappings g and / are well-defined and measurable. Thus we can write S + — f(S),S~ — 
g(3) and using (B.l) we get 

H(S + fs) = 0, H(S~\3) = 0, (B.2) 

hCs,S + ) = H(*S), H(3,S~) = H(3), (B.3) 

h{3\3,s + ) = h{3\s + ), hCs\3,s-) = hCs\s-). (B.4) 

In the next step we show I(S ; S ; S + ; <S~) = 1(3; S) = E. Consider 

I{3;3;S-\S + ) = I(3; < S\S+) - l(^; i S\S + ;S-), (B.5) 
and remark that the first term vanishes because with (B.4) it holds that 

1(3; = H(3\S + ) - H(3\3, S + ) ( ^ 4) 0. 

The second expression of (B.5) is also zero, since 

1(3; 3\S + ; S~) = H($\S + ,S-) - H(3\S + ,S-, 3) 0. 
Putting all together we yield 

l(^; < S;S-\S + ) =0. 

Furthermore we have 

1(3; 3; S~) = 1(3; 3) - 1(3; 3\S~) = 1(3; 3), 

<- <- <- ( BA ) 

since I(S; S \S~) = H(S \S~) - H( S \ S , S~) = ' 0. Putting things together we get 

I(3;3;S + ;S~) = 1(3;%. 

In a second step we show 1(3; S ; S + ; S~) = I(S + ; S~). As in the first step the following 
term vanish 

I(S + ;S~;3\3) = I(S + ;S-\3) - I(S + ;S~\3;3) = 0, (B.6) 
since I(S + ;S-\S) = H(S + \S) - H(S+\S-, S) = ; and 

I(S+; S~\3; 3) = H(S + \3, 3) - H(S + \S-,3, 3) 0. 
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We consider now 

I(S + ;S-;^) = I(S + ;S-)- I(S + ;S-\~$), 
and the second term disappear, since 

I(S + ;S-\~i) = H(S-\~i)- H(S-\S + ,~i) ^ =^ 0. 

This results in 

l(3; < S;S + ;S-) = I(S+;S-), 

and finally we get 

E = I(3; i S) = I(S + ;S~). 

□ 

B. 3 Corollary ([E1109]) It holds that 

C+ = E + H(S + \S-), 

C P =E + H(S-\S + ). 
Furthermore the following inequalities hold 

Cp:>H{S + \S-), C P >H(S-\S + ). 

Proof. 

The first two claims follow with E = I(S + ; S~) = H(S+) - H(S+\S-), C+ = H(S + ) and 
the symmetry of mutual information. Due to E > the other two inequalities follows. □ 

C Spectral Analysis of Substitution Systems 

This section is an excerpt of Chapter 5 in [Que87]. We only state the proofs which are 
relevant for us and refer for the remaining parts to [Que87]. In the following we consider a 
special type of dynamical systems, which often leads to interesting sequences of symbols. 
As usual we denote with A a finite alphabet, e.g. A := {0, . . . , s — 1}. Furthermore we 
define A* := Uk>iA k as the set of all finite words over A. 

C. l Definition A mapping ( : A — > A* is called a substitution on A. To every letter 
i G A we assign a word ((i) such that for at least one letter i it holds that U := > 2. 
If li = q > 2 holds for all i G A, then £ is a substitution of constant length q. 

Every substitution £ induces a mapping £ : A* — > A* with 

((B) :=C(6o)...C(6n), B = b ...b n eA*. 

Similar we define a mapping £ : ^4 N — > A K . We equip *4 N with the discrete topology such 
that the mapping £ is continuous with respect to this topology. Remark that in general £ 
is not surjective. ( k denotes the A;-th iterative of £. Fixed points of ( k for a k > 1 are of 
special interest for us and the next proposition give sufficient conditions for the existence 
of a fixed point. 
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C.2 Proposition Let £ be a substitution with |£ n (o:)| — > oo for every a E A. Then 
there exists a fixed point u G *4. N and an integer k > 1, sucn that w = C fc ( M )- 

Henceforth we assume that £ fulfills the following two conditions 

lim |C n (/3)| = oo, for all e A, (C.l) 

n->oo 

there exist a £ A (denoted as in the following), such that the word ((a) starts with a. 

(C2) 

In particular this condition guarantees the existence of a fixed point, which we denote as u 
in the following. From now on the alphabet A consists only of those letters which actually 
appear in the word C n (0) for all n > 0. As an example we consider the substitution which 
generates the Thue-Morse sequence. With A = {0,1} and £(0) = 01, ((1) = 10 the 
Thue-Morse sequence is the fixed point u = £°°(0). 

For C and u we associate a topological dynamical system (X,T), with T : A N — > A N as 
the one-sided shift mapping on A n and X := 0{u) where 0(u) := {T n {u) : n > 0}. 
We want to introduce the concept of ergodicity for the associated system and define for 
that the notion of minimality. 

C.3 Definition A topological dynamical system(X, T) is called minimal, if the T- 

invariant sets in X are only X and 0. 

Minimality is characterized as follows. 

C.4 Proposition The system (X,T) is minimal <^=^ 0(x) is dense in X for every 
xeX. 

In particular for the associated system the following result holds. 

C.5 Proposition The system (X, T) is minimal if and only if for every a £ A there exist 
an integer k > 0, such that ( k (a) contains 0. 

C.6 Definition A substitution £ is called irreducible on A, if for every pair of letters 
a, (3 G A an integer k = k(a, j3) exist, such that (3 G C h { a )- C JS called primitive, if there 
exist an integer k independent of a, (3, such that (3 G ( k (a) for all a, f3 G A. 

The condition in Proposition C.5, which guarantees minimality, implies that that £ is 
primitive. If £ is primitive then X is not depending on the fixed point u instead it only 
depends on (, since every letter appears in u. Because of this the system (X, T) is uniquely 
determined through £ and we denote it sometimes as (X(£),T). 

For two words B, C G A* we denote with Lq{B) the number how often the word C 
appears in B. In particular for a letter i G A we write Li(B) for the number the letter i 
appears in B. 

C.7 Definition The s x s-matrix M = M(Q with = Li(((j)) for i,j G A is called 
^-matrix. 
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M is a positive s x s-matrix with nonnegative integer entries. For every j e A we have 
= Lj(((j)) = \C(j)\, where (., .) is the scalarproduct in IR S . For a word 

B E A* we denote with L(B) a vector in M s with entries L^B) for < i < s — 1. 
It holds that L(Q(B)) = M ■ L(B) and in particular L(((j)) = (m,j) ie _4. We denote 
L : .4* — > R s as composition-function and M also as composition-matrix. Remark 
that £ is primitive if M(Q is primitive, i.e. M k has positive entries for a A;. The next 
proposition gives interesting properties about primitive matrices, which are crucial in the 
following treatment. 

C.8 Proposition (Perron-Frobenius) Let M be a primitive, positive matrix. Then it 
holds that 

(a) M has a strictly positive eigenvalue 9, such that 6 > |A| for all eigenvalues X of M 
which are different from ©. 

(h) There exist a strictly positive eigenvector for 6. 

(c) 6 is a simple eigenvalue. 

The dominating eigenvalue G is also called Perron-Frobenius eigenvalue (PF-eigenvalue). 
A positive matrix is called irreducible, if for every i,j an integer k > 1 exist, such that 
mp > 0. 

C.9 Remark With the weaker assumption of an irreducible matrix one can almost show 
the result of Perrron-Frobenius analogously. Only part (a) changes as follows. M has a 
strictly positive eigenvalue such that > |A| for every eigenvalue A of M different from 
0. We can classify the eigenvalues with A ^ and |A| = with the help of periodicity 
exactly. 

C.10 Definition The period d > 1 of an irreducible, positive matrix M is the smallest 
common divisor of the set {k > 1 : > 0} for every i. 

In particular we have the following relation. 

C.ll Proposition An irreducible, positive matrix is primitive if, and only if the period 
is d = 1. 

C.12 Proposition Let M be an irreducible, positive matrix with period d > 1, then 
there are exactly d eigenvalues A of M with |A| = and A = Q e 2mk l d . 

The next proposition is a consequence of Perron-Frobenius and the first step towards 
unique ergodicity of the system (X((),T). 

C.13 Proposition Let ( be a primitive substitution. For every a & A the s-dimensional 
vector ^ L ^ Qn°^ ^ converges to the strictly positive eigenvector v(et) for the PF-eigenvalue 
0. 
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With that we get the following result. 

C.14 Proposition For every a E A it holds that 

\C n+1 (a 



Proof. 

Using Proposition C.13 it follows 



IC" + »I = (L(c + »),i) e" +1 n-^ Ma), i) 

|C"(«)| ~ (^(C n (a),I) (v(a),I> ' 



□ 



The next proposition shows that every letter in u appears with a positive frequency in 
u if C is a primitive substitution (in particular if ( fulfills the two conditions (C.l) and 
(C.2)). 



C.15 Proposition Let a,j G A, then it holds that 



|C(a) 

where dj > is independent of a. 



Proof. 

Using Proposition C.13 and |C"(a)| = {L(( n (a)), T) we get 

L(C n (a)) _ L{C{a)) 0" n -^o v(a) 



|C"(«)| 6™ (L(C n («)),I) (v(a),iy 

The limit is the strictly positive and normed eigenvector of 6. Because of that dj = Vj > 
and not depending on a. □ 

We now show that the system (X(Q,T) is uniquely ergodic. A topological dynamical 
system is called uniquely ergodic, if there is an unique T-invariant probability measure 
H on X. In order to achieve that we generalize the last result and replace primitivity with 
the more general condition of minimality. 

C.16 Proposition Assume the system (X, T) is minimal. Then for every letter a in the 
fixed point u and every word B in u it holds that 

He-Mr " dB ' 

where ds > is not depending on a. 
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Proof. 

Let B be a word in u of length I > 1. For / = 1 the claim follows with Proposition C.15. 
W.l.o.g. we assume I > 2 and define 

fi,:={Se«: |S| = /}. 

We show the claim while we use Qi as a new alphabet and define a corresponding substi- 
tution Q such that we can apply Proposition C.15. 

Let co be a letter in the new alphabet fy. We define Q : Qi — >■ with the notation 
CM = CM • • • = y ■ ■ • 2/|c(^o)|-i2/|C(^o)| • • • f ICMI-i' 

where ?/j G ^4 as 

OM := (2/0 • • • 2/1-1X2/1 • • • 2//) • • • (yic(wo)i-i • • • yic(wo)i+i-2)- 

We can extend via concatenation of symbols to and fif . We now show the following 
two properties of Q. 

(a) (1 has a fixed point {/; G ilf with t/j = (u . . . ui-i)(ui . . . Ui)(u 2 ■ ■ ■ u l+ i) . . ., 

(b) Q is primitive, if C is primitive. 
Proof of (a). 

Let cu — Uq . . . ui_i with ((u) — u . . . u\£( u )\-i and u = we get 

OM = ( U ■ ■ ■ ■■■Ui)... (M| C(ll0 )|_i . . . U| C(uo) | + i_ 2 ). 

The word 0M starts with u and with Proposition C.2 the existence of a fixed point 
follows. For every n > 1 we have 

C"M = ("o ■ ■ ■ ■ ■ ■ ui) . . . (ti|C»(«o)|-l • • • «|C"(«o)|+J-2), 

such that Cr°M = u i- 
Proof of (b). 

With (a) C/ fulfills the conditions (C.l) and (C.2) (use u . . .ui-x instead of 0). Because 
of that it is enough to show irreducibility of Q on Q t . Let u, B G VL h . Since u = C n (u) for 
every n, there is an a G A and p > 1 such that B C C p ( a )- Because C is primitive it holds 
that a G C™M) for m > m and we get B C ( m+p (oj ). With the notation 

CM = C n M) • CM, ■ ■ ■ , wj-i) = 2/oZ/i • • • yieCwoJI-iQ'oai 

we obtain 

C n M = (i/o • • • 2//-i)(2/i • • • 2/0 • • • (y|C"(«o)|-iao • • • aj-2)- (C.3) 
CfM contains all words of length / which appear in ( n (uo). Choose m big enough and 
define n := m + p, then Cf M contains finally the word B and the claim is proven. 
We now apply Proposition C.15 to Q and obtain 
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where d B is not depending on uo. Obviously we have with (C.3) that |<^ n (u;)| = |C n (^o)| 
and Lb(£™(u;)) ~ L B (( n (u )) for n — > oo. Hence we get 

hm \m( \\ =d B >0. 

□ 

The value of d B is the frequency of the word B in the fixed point u and is the S-entry of 
the normed eigenvector of the composition-matrix for Q. 

C.17 Example As an example we consider the Thue-Morse sequence with M(Q = 
M = ^ | | ^ and eigenvalues O = 2, A = 0. Define ( 2 on the alphabet f2 2 = 
{(00), (01), (10), (11)} like in the proof above 

C 2 ((00)) := (01)(10), 

C 2 ((01)):=(01)(ll), 
C 2 ((10)) := (10)(00), 
C 2 ((H)) := (10)(01). 
The composition-matrix M 2 for £ 2 is 



Mo 



/ 1 \ 
110 1 
10 11 
\ 1 j 



with eigenvalues = 2, A = 0,1,-1. The normed eigenvector for the eigenvalue 6 is 
v ~ (!'!'!'!)' sucn ^ na ^ ^ ne frequencies of the pairs in u are as follows 

1 1 

d{oo) = g = d(u), <^(oi) = g = ^(10) • 

Let B be a word in u, then [B] denotes the cylinderset which is generated by B 

[B] := {x e X : Xj = bj, < j < \B\ - 1}, with B = b . . . b\ B \-i- 
Let n be a T-invariant probability measure on X, then we can write yU as 

H([B}) = lim ^-\{n < AT- : u n . . .u n+ \ Bhl = B}\, 

for a sequence (Nj) and every cylinderset [£]. In particular /i is for Nj = \^(a)\ a T- 
invariant probability measure and it holds that fi{[B]) = d B . The next proposition tell us 
that this measure is also uniquely determined. 
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C.18 Proposition If the system (X,T) is minimal, then it is uniquely ergodic. 
In particular we have the following 

C.19 Corollary Every vector /i = is the normed eigenvector to the PF- 

eigenvalue 0. 

In the next step we want to investigate the composition-matrix Mi of Q : Qi — > Qi and 
will derive an effective method to calculate the frequencies of factors in u. 

C.20 Proposition Let M = M(Q be a primitive matrix with PF-eigenvalue 0, then Mi 
is a primitive matrix with the same PF-eigenvalue for every I > 2. 

We now show that we can derive the distribution of every word in Qi from the distribution 
of the words in Q 2 . For that we fix I > 2. Remark that for p > 1 it holds (Q) p = (C p )i, 
where (( p )i : Qi — > fijf with ui = ui . . . uj^i G f2/ and 

C p {u) = c p (^ ) • • • CVz-i) = yo ■ ■ ■ y\<?{ U )\-i, 

is defined as follows 

(C p )/M := (y --- yi-i)(yi ---yi)--- (yic^o)i-i ■ ■ ■ y\c^)\+i-2) ■ 

If p is greater than I, such that the condition 

\C(uj )\ + 1-2<\C(u )\ + \C(uj 1 )\ (C.4) 

is fulfilled, then (( p )i is completely determined through the knowledge of the first two 
letters u Ui of u on u G VL h . Proposition C.13 gives ~ p ||v(o>i)|| as p — > oo. So 

we can express condition (C.4) with 

P > C ■ I, 

where C > is a constant. We now fix p and /, such that condition (C.4) is fulfilled. Let 
7r 2 : fli — > SI2 be the projection on the first two letters, that means 7r 2 (cj • • • = Wo^i- 
We define t 2 ^ p : Q 2 — > &>i with 

T 2 ,i, p (u) u) 1 ) := (y ... yi-i)(yi . . . yi) . . . (?/|o>(u,o)|-i ■ ■ ■ y\^ )\+i- 2 ), 

if uou)! G fi 2 and CVo^i) = yo ■ ■ • 2/|c*>(u;o)|-iy|C p (wo)| ■ ■ ■ flCO-wOI-i- Obviously it holds 
that 

T 2 ,l,p ° 7T 2 = Cf , 7T2 T 2 ,Z,p = Cf , 

and 

T 2 ,i,p = r VtP o C, 2 . 

We can extend the mappings r 2 ,i, p and ir 2 in a natural way to mappings r 2i z iP : f2 2 — >■ ^* 
and 7r 2 : Vt* t — >■ f2 2 and get the following commutative diagram 
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n* ► n* ► n* 




We write Li : — > M} n ^ for the composition function of words u> in then we get 

Lj(CiH) = M i L i(oj), with w G fy*, 

and 

M 2 ^ p L 2 {k 2 {uj)) = L,«f M), 

where M 2 j tP is the composition-matrix of t 2i j ;J) . With i as a matrix for the projection 7r 2 
and £ = 7r 2 (u;) we get the following commutative diagram 

LM ^ ► L t ((f ( W )) ^ ► ^(Cf») 




C.21 Corollary Tiie eigenvalues of M/ coincide with the eigenvalues of M 2 , if they are 
not equal to zero. 

Proof. 

Because of M 2jljP • M 2 = M/ • M 2 j :P it holds that for every algebraic polynom 

M 2 , hp ■ Q(M 2 ) = Q(M,) • M 2iJjP . 

On the other hand we have M 2Xp ■ Q{M 2 )A = Q(M t ) ■ Mf, such that for Q{M 2 ) = the 
polynom X — > Q(X) ■ X p leads to vanishing of the matrix Mi. Furthermore M 2 Q(M 2 ) = 
AQ(Mi)M[ implies that X ->• Q(X)-X P leads to vanishing of the matrix M 2 , if Q(M t ) = 0. 

□ 

C.22 Corollary Jf u 2 is an eigenvector of M 2 for the eigenvalue 0, then M 2 j yP ■ v 2 is an 
eigenvector of Mi for the eigenvalue O. 

Proof. 

The claim follows from the fact that M 2 j iP ■ M 2 = Mi ■ M 2 j jP . □ 

For determining the frequency of a word w G fij of length I it is enough to determine the 
frequency for every pair (a/3). Count how often ui appear in ( p (a{3) under the condition 
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that the first letter of u is in ( p (a). This is then the entry in M 2 j >p on the position 
(u,(aj3)) G Qi x Q 2 . If one consider for example the Thue-Morse sequence and want to 
calculate the frequencies of words with length 5 one has to set p = 3 so that condition 
(C.4) is fulfilled and get 

( 3 (00) = 0110.1001.0110.1001, 
C 3 (01) = 0110.1001.1001.0110, 
C 3 (10) = 1001.0110.0110.1001, 
C 3 (ll) = 1001.0110.1001.0110. 
There are 12 words of length 5 in u 

(00101) (00110) (01001) (01011) (01100) (01101) 

(11010) (11001) (10110) (10100) (10011) (10010). 
The A^^p-matrix has the following form 



/ 1 





1 


1 





1 


1 





1 


1 





1 


1 





1 


1 





1 


1 





1 


1 





1 


1 


1 





1 





1 


1 





1 





1 


1 


1 


1 





1 





1 


1 





V i 





1 


1 



Because of v 2 = (1, 2, 2, 1) we get v 5 = M 2 j^ p ■ v 2 = (4, . . . , 4). Therefore every word has 
the frequency ■ Analogous we can calculate the frequencies of words with arbitrary 
length. 
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